Red & White Wine Quality by Di Wang

Univariate Plots Section

## [1] 6497   14
##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"              "color"
## 'data.frame':    6497 obs. of  14 variables:
##  $ X                   : int  4898 4895 4894 4893 4892 4891 4889 4883 4882 4880 ...
##  $ fixed.acidity       : num  6 6.6 6.2 6.5 5.7 6.1 6.8 5.5 5 6.6 ...
##  $ volatile.acidity    : num  0.21 0.32 0.21 0.23 0.21 0.34 0.22 0.32 0.235 0.34 ...
##  $ citric.acid         : num  0.38 0.36 0.29 0.38 0.32 0.29 0.36 0.13 0.27 0.4 ...
##  $ residual.sugar      : num  0.8 8 1.6 1.3 0.9 ...
##  $ chlorides           : num  0.02 0.047 0.039 0.032 0.038 0.036 0.052 0.037 0.03 0.046 ...
##  $ free.sulfur.dioxide : num  22 57 24 29 38 25 38 45 34 68 ...
##  $ total.sulfur.dioxide: num  98 168 92 112 121 100 127 156 118 170 ...
##  $ density             : num  0.989 0.995 0.991 0.993 0.991 ...
##  $ pH                  : num  3.26 3.15 3.27 3.29 3.24 3.06 3.04 3.26 3.07 3.15 ...
##  $ sulphates           : num  0.32 0.46 0.5 0.54 0.46 0.44 0.54 0.38 0.5 0.5 ...
##  $ alcohol             : num  11.8 9.6 11.2 9.7 10.6 ...
##  $ quality             : int  6 5 6 5 6 6 5 5 6 6 ...
##  $ color               : chr  "White" "White" "White" "White" ...
##        X        fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  1st Qu.: 813   1st Qu.: 6.400   1st Qu.:0.2300   1st Qu.:0.2500  
##  Median :1650   Median : 7.000   Median :0.2900   Median :0.3100  
##  Mean   :2044   Mean   : 7.215   Mean   :0.3397   Mean   :0.3186  
##  3rd Qu.:3274   3rd Qu.: 7.700   3rd Qu.:0.4000   3rd Qu.:0.3900  
##  Max.   :4898   Max.   :15.900   Max.   :1.5800   Max.   :1.6600  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  1.00     
##  1st Qu.: 1.800   1st Qu.:0.03800   1st Qu.: 17.00     
##  Median : 3.000   Median :0.04700   Median : 29.00     
##  Mean   : 5.443   Mean   :0.05603   Mean   : 30.53     
##  3rd Qu.: 8.100   3rd Qu.:0.06500   3rd Qu.: 41.00     
##  Max.   :65.800   Max.   :0.61100   Max.   :289.00     
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200  
##  1st Qu.: 77.0        1st Qu.:0.9923   1st Qu.:3.110   1st Qu.:0.4300  
##  Median :118.0        Median :0.9949   Median :3.210   Median :0.5100  
##  Mean   :115.7        Mean   :0.9947   Mean   :3.219   Mean   :0.5313  
##  3rd Qu.:156.0        3rd Qu.:0.9970   3rd Qu.:3.320   3rd Qu.:0.6000  
##  Max.   :440.0        Max.   :1.0390   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality         color          
##  Min.   : 8.00   Min.   :3.000   Length:6497       
##  1st Qu.: 9.50   1st Qu.:5.000   Class :character  
##  Median :10.30   Median :6.000   Mode  :character  
##  Mean   :10.49   Mean   :5.818                     
##  3rd Qu.:11.30   3rd Qu.:6.000                     
##  Max.   :14.90   Max.   :9.000

The quality of wine has a slightly skewed normal distribution. Most wine were rated as 5 or 6. The lowest rating is 3 and the highest rating is 9. We would like to plot the distribution of each individual factors and try to find the potential relationships.

At the first glance, the following factors have normal distribution: 1. Fixed Acidity 2. Volatile.acidity 3. Desity 4. PH

And, the following factors have a slightly skewed distribution,which is more like the quality distribution: 1. citric.acid 2. residual.sugar 3. chlorides 4. free.sulfur.dioxide 5. total.sulfur.dioxide 6. sulphates 7. alcohol

Due to the nature of the description, the (11) factors can be classified as following: 1. Acids 2. Sugar 3. Alcohol 4. Chlorides 5. Sulphates

We will mainly examine these (5) factors and their relationship to quality.

Univariate Analysis

What is the structure of your dataset?

There are 6497 observations of 14 variables (X,fixed.acidity,volatile.acidity,citric.acid,residual.sugar,chlorides,free.sulfur.dioxide,total.sulfur.dioxide,density,pH,sulphates,alcohol,quality,color). Quality is an ordered, categorical, discrete variable. It was on a 0-10 scale, rated by at least 3 wine experts. The values ranged only from 3 to 9, with a mean of 5.818 and median of 6. X is the numbering system for the wine samples. Color was a created categorical factor. All other variables are all quantitative factors about the chemical content in wine.

What is/are the main feature(s) of interest in your dataset?

The main feature of interest is the factors affecting the quality of red/white wine. I suspected that the alcohol, residual.sugar and PH will affect the quality of red/white wine. The other point of interest is the difference between red/white wine.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

From the description of the variables, it seems that the fixed.acidity & volatile.acidity, free.sulfur.dioxide & total.sulfur.dioxide, alcohol & density can be corralated variables.

Did you create any new variables from existing variables in the dataset?

Yes, ‘color’ was the created new variables.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Factors like residual.sugar/free.sulfur.dioxide has significant outliers. However, considering the unit used, the outliers can be accepted and the data is tidy data.

Bivariate Plots Section

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Density has negative relationship with alcohol. It also has positive correlation with residual sugar. The correlation coefficients are -0.687 and 0.553 respectively.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

  1. The white wine tend to have more alcohol, more residual sugar and less acids, less sulphates and chlorides.

  2. As it has been assumed in section 1, there are some instinct relationship between the variables. For example, the free.sulfur.dioxide and total.sulfur.dioxide are positively related to each other. pH has negative relationship with acids.

What was the strongest relationship you found?

The strongest relationship is between density and alhocol (R=-0.687), which makes sense because alhocol has smaller desity than water (desity = 49.3 lb/ft^3 and 62.4 lb/ft^3)

Multivariate Plots Section

## Warning: Removed 1204 rows containing non-finite values (stat_smooth).
## Warning: Removed 1214 rows containing missing values (geom_point).

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

In this section, I found that the quality is related to alcohol, residual sugar, sulphates, chlorides, and acids.

Were there any interesting or surprising interactions between features?

The standards used to judge the quality of red wine and white wine are different. For red wine, both residual sugar and acidityhave positive relationship with the quality. However, for white wine, both factors are negatively related to the quality. Sulphate has positive effect in red wine but white wine is not sensitive to sulphates. Chlorides has negative effect on both red wine and white wine, but there are significantly amount of outliers.


Final Plots and Summary

Plot One: Quality Distribution of Red & White Wine

The quality of wine has a slightly skewed normal distribution. Most wine were rated as 5 or 6. The lowest rating is 3 and the highest rating is 9.

Plot Two: Difference between Red & White Wine

This picture depicts the difference between red and white wine. Red wine has more acids, sulphate, chlorides, less sugar and slightly less alcohol.

Plot Three: The Alcohol vs.Wine Quality

The trend for relationship between alcohol content and quality are rather similar for both red and white wine. The wine rated as 5 has the lowest alcohol content. Overall, alcohol has positive relationship with the wine quality.


Reflection

“The biggest difference between reds and whites is in how they’re made. The grapes used for red and white wines generally look very different—as you might imagine, red wine grapes are darker and have more pigment. When making white wine, typically the grapes are pressed and then just the juice is fermented.”1

The nature and brewing processes made the telling difference. Through the data, we looked into the differences between red and white wine from their chemical contents. Compared to the red wine, the white wine tend to have higher alcohol, more residual sugar and less acids, less sulphates and chlorides (probably because of the brewing process).

Some facotrs affecting quality also differed in red and white wine. Residual sugar and acids made positive contribution to the quality but they will decrease the taste for white wine. Sulphate positively influenced the red wine quality but white wine seems to be insensitive to this chemical. Both wine proned to rate higher alcohol content as better quality.

After all, quality rating is a relatively subjective factor. Human-beings, even the experts have their limits in distinguishing the tiny difference between each sample, not mentioned the consumers. That’s probably why most wine were rated as 5 or 6. If more extreme cases (below 3 or greater than 8) can be gathered, I would be interested to see why those samples stand out as unique.

Reference: 1. http://www.winespectator.com/drvinny/show/id/44697